home *** CD-ROM | disk | FTP | other *** search
Text File | 1998-10-28 | 48.6 KB | 1,387 lines |
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- NNNNAAAAMMMMEEEE
- perlre - Perl regular expressions
-
- DDDDEEEESSSSCCCCRRRRIIIIPPPPTTTTIIIIOOOONNNN
- This page describes the syntax of regular expressions in
- Perl. For a description of how to _u_s_e regular expressions
- in matching operations, plus various examples of the same,
- see discussion of m//, s///, qr// and ?? in the section on
- _R_e_g_e_x_p _Q_u_o_t_e-_L_i_k_e _O_p_e_r_a_t_o_r_s in the _p_e_r_l_o_p manpage.
-
- The matching operations can have various modifiers. The
- modifiers that relate to the interpretation of the regular
- expression inside are listed below. For the modifiers that
- alter the way a regular expression is used by Perl, see the
- section on _R_e_g_e_x_p _Q_u_o_t_e-_L_i_k_e _O_p_e_r_a_t_o_r_s in the _p_e_r_l_o_p manpage
- and the section on _G_o_r_y _d_e_t_a_i_l_s _o_f _p_a_r_s_i_n_g _q_u_o_t_e_d _c_o_n_s_t_r_u_c_t_s
- in the _p_e_r_l_o_p manpage.
-
- i Do case-insensitive pattern matching.
-
- If use locale is in effect, the case map is taken from
- the current locale. See the _p_e_r_l_l_o_c_a_l_e manpage.
-
- m Treat string as multiple lines. That is, change "^" and
- "$" from matching at only the very start or end of the
- string to the start or end of any line anywhere within
- the string,
-
- s Treat string as single line. That is, change "." to
- match any character whatsoever, even a newline, which it
- normally would not match.
-
- The /s and /m modifiers both override the $* setting.
- That is, no matter what $* contains, /s without /m will
- force "^" to match only at the beginning of the string
- and "$" to match only at the end (or just before a
- newline at the end) of the string. Together, as /ms,
- they let the "." match any character whatsoever, while
- yet allowing "^" and "$" to match, respectively, just
- after and just before newlines within the string.
-
- x Extend your pattern's legibility by permitting
- whitespace and comments.
-
- These are usually written as "the /x modifier", even though
- the delimiter in question might not actually be a slash. In
- fact, any of these modifiers may also be embedded within the
- regular expression itself using the new (?...) construct.
- See below.
-
- The /x modifier itself needs a little more explanation. It
- tells the regular expression parser to ignore whitespace
-
-
-
- Page 1 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- that is neither backslashed nor within a character class.
- You can use this to break up your regular expression into
- (slightly) more readable parts. The # character is also
- treated as a metacharacter introducing a comment, just as in
- ordinary Perl code. This also means that if you want real
- whitespace or # characters in the pattern (outside of a
- character class, where they are unaffected by /x), that
- you'll either have to escape them or encode them using octal
- or hex escapes. Taken together, these features go a long
- way towards making Perl's regular expressions more readable.
- Note that you have to be careful not to include the pattern
- delimiter in the comment--perl has no way of knowing you did
- not intend to close the pattern early. See the C-comment
- deletion code in the _p_e_r_l_o_p manpage.
-
- RRRReeeegggguuuullllaaaarrrr EEEExxxxpppprrrreeeessssssssiiiioooonnnnssss
-
- The patterns used in pattern matching are regular
- expressions such as those supplied in the Version 8 regex
- routines. (In fact, the routines are derived (distantly)
- from Henry Spencer's freely redistributable reimplementation
- of the V8 routines.) See the section on _V_e_r_s_i_o_n _8 _R_e_g_u_l_a_r
- _E_x_p_r_e_s_s_i_o_n_s for details.
-
- In particular the following metacharacters have their
- standard _e_g_r_e_p-ish meanings:
-
- \ Quote the next metacharacter
- ^ Match the beginning of the line
- . Match any character (except newline)
- $ Match the end of the line (or before newline at the end)
- | Alternation
- () Grouping
- [] Character class
-
- By default, the "^" character is guaranteed to match at only
- the beginning of the string, the "$" character at only the
- end (or before the newline at the end) and Perl does certain
- optimizations with the assumption that the string contains
- only one line. Embedded newlines will not be matched by "^"
- or "$". You may, however, wish to treat a string as a
- multi-line buffer, such that the "^" will match after any
- newline within the string, and "$" will match before any
- newline. At the cost of a little more overhead, you can do
- this by using the /m modifier on the pattern match operator.
- (Older programs did this by setting $*, but this practice is
- now deprecated.)
-
- To facilitate multi-line substitutions, the "." character
- never matches a newline unless you use the /s modifier,
- which in effect tells Perl to pretend the string is a single
- line--even if it isn't. The /s modifier also overrides the
-
-
-
- Page 2 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- setting of $*, in case you have some (badly behaved) older
- code that sets it in another module.
-
- The following standard quantifiers are recognized:
-
- * Match 0 or more times
- + Match 1 or more times
- ? Match 1 or 0 times
- {n} Match exactly n times
- {n,} Match at least n times
- {n,m} Match at least n but not more than m times
-
- (If a curly bracket occurs in any other context, it is
- treated as a regular character.) The "*" modifier is
- equivalent to {0,}, the "+" modifier to {1,}, and the "?"
- modifier to {0,1}. n and m are limited to integral values
- less than 65536.
-
- By default, a quantified subpattern is "greedy", that is, it
- will match as many times as possible (given a particular
- starting location) while still allowing the rest of the
- pattern to match. If you want it to match the minimum
- number of times possible, follow the quantifier with a "?".
- Note that the meanings don't change, just the "greediness":
-
- *? Match 0 or more times
- +? Match 1 or more times
- ?? Match 0 or 1 time
- {n}? Match exactly n times
- {n,}? Match at least n times
- {n,m}? Match at least n but not more than m times
-
- Because patterns are processed as double quoted strings, the
- following also work:
-
- \t tab (HT, TAB)
- \n newline (LF, NL)
- \r return (CR)
- \f form feed (FF)
- \a alarm (bell) (BEL)
- \e escape (think troff) (ESC)
- \033 octal char (think of a PDP-11)
- \x1B hex char
- \c[ control char
- \l lowercase next char (think vi)
- \u uppercase next char (think vi)
- \L lowercase till \E (think vi)
- \U uppercase till \E (think vi)
- \E end case modification (think vi)
- \Q quote (disable) pattern metacharacters till \E
-
- If use locale is in effect, the case map used by \l, \L, \u
-
-
-
- Page 3 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- and \U is taken from the current locale. See the _p_e_r_l_l_o_c_a_l_e
- manpage.
-
- You cannot include a literal $ or @ within a \Q sequence.
- An unescaped $ or @ interpolates the corresponding variable,
- while escaping will cause the literal string \$ to be
- matched. You'll need to write something like
- m/\Quser\E\@\Qhost/.
-
- In addition, Perl defines the following:
-
- \w Match a "word" character (alphanumeric plus "_")
- \W Match a non-word character
- \s Match a whitespace character
- \S Match a non-whitespace character
- \d Match a digit character
- \D Match a non-digit character
-
- A \w matches a single alphanumeric character, not a whole
- word. To match a word you'd need to say \w+. If use locale
- is in effect, the list of alphabetic characters generated by
- \w is taken from the current locale. See the _p_e_r_l_l_o_c_a_l_e
- manpage. You may use \w, \W, \s, \S, \d, and \D within
- character classes (though not as either end of a range).
-
- Perl defines the following zero-width assertions:
-
- \b Match a word boundary
- \B Match a non-(word boundary)
- \A Match only at beginning of string
- \Z Match only at end of string, or before newline at the end
- \z Match only at end of string
- \G Match only where previous m//g left off (works only with /g)
-
- A word boundary (\b) is defined as a spot between two
- characters that has a \w on one side of it and a \W on the
- other side of it (in either order), counting the imaginary
- characters off the beginning and end of the string as
- matching a \W. (Within character classes \b represents
- backspace rather than a word boundary.) The \A and \Z are
- just like "^" and "$", except that they won't match multiple
- times when the /m modifier is used, while "^" and "$" will
- match at every internal line boundary. To match the actual
- end of the string, not ignoring newline, you can use \z.
- The \G assertion can be used to chain global matches (using
- m//g), as described in the section on _R_e_g_e_x_p _Q_u_o_t_e-_L_i_k_e
- _O_p_e_r_a_t_o_r_s in the _p_e_r_l_o_p manpage.
-
- It is also useful when writing lex-like scanners, when you
- have several patterns that you want to match against
- consequent substrings of your string, see the previous
- reference. The actual location where \G will match can also
-
-
-
- Page 4 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- be influenced by using pos() as an lvalue. See the pos
- entry in the _p_e_r_l_f_u_n_c manpage.
-
- When the bracketing construct ( ... ) is used, \<digit>
- matches the digit'th substring. Outside of the pattern,
- always use "$" instead of "\" in front of the digit. (While
- the \<digit> notation can on rare occasion work outside the
- current pattern, this should not be relied upon. See the
- WARNING below.) The scope of $<digit> (and $`, $&, and $')
- extends to the end of the enclosing BLOCK or eval string, or
- to the next successful pattern match, whichever comes first.
- If you want to use parentheses to delimit a subpattern
- (e.g., a set of alternatives) without saving it as a
- subpattern, follow the ( with a ?:.
-
- You may have as many parentheses as you wish. If you have
- more than 9 substrings, the variables $10, $11, ... refer to
- the corresponding substring. Within the pattern, \10, \11,
- etc. refer back to substrings if there have been at least
- that many left parentheses before the backreference.
- Otherwise (for backward compatibility) \10 is the same as
- \010, a backspace, and \11 the same as \011, a tab. And so
- on. (\1 through \9 are always backreferences.)
-
- $+ returns whatever the last bracket match matched. $&
- returns the entire matched string. ($0 used to return the
- same thing, but not any more.) $` returns everything before
- the matched string. $' returns everything after the matched
- string. Examples:
-
- s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words
-
- if (/Time: (..):(..):(..)/) {
- $hours = $1;
- $minutes = $2;
- $seconds = $3;
- }
-
- Once perl sees that you need one of $&, $` or $' anywhere in
- the program, it has to provide them on each and every
- pattern match. This can slow your program down. The same
- mechanism that handles these provides for the use of $1, $2,
- etc., so you pay the same price for each pattern that
- contains capturing parentheses. But if you never use $&,
- etc., in your script, then patterns _w_i_t_h_o_u_t capturing
- parentheses won't be penalized. So avoid $&, $', and $` if
- you can, but if you can't (and some algorithms really
- appreciate them), once you've used them once, use them at
- will, because you've already paid the price. As of 5.005,
- $& is not so costly as the other two.
-
- Backslashed metacharacters in Perl are alphanumeric, such as
-
-
-
- Page 5 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- \b, \w, \n. Unlike some other regular expression languages,
- there are no backslashed symbols that aren't alphanumeric.
- So anything that looks like \\, \(, \), \<, \>, \{, or \} is
- always interpreted as a literal character, not a
- metacharacter. This was once used in a common idiom to
- disable or quote the special meanings of regular expression
- metacharacters in a string that you want to use for a
- pattern. Simply quote all non-alphanumeric characters:
-
- $pattern =~ s/(\W)/\\$1/g;
-
- Now it is much more common to see either the _q_u_o_t_e_m_e_t_a()
- function or the \Q escape sequence used to disable all
- metacharacters' special meanings like this:
-
- /$unquoted\Q$quoted\E$unquoted/
-
- Perl defines a consistent extension syntax for regular
- expressions. The syntax is a pair of parentheses with a
- question mark as the first thing within the parentheses
- (this was a syntax error in older versions of Perl). The
- character after the question mark gives the function of the
- extension. Several extensions are already supported:
-
- (?#text) A comment. The text is ignored. If the /x switch
- is used to enable whitespace formatting, a simple
- # will suffice. Note that perl closes the comment
- as soon as it sees a ), so there is no way to put
- a literal ) in the comment.
-
- (?:pattern)
-
- (?imsx-imsx:pattern)
- This is for clustering, not capturing; it groups
- subexpressions like "()", but doesn't make
- backreferences as "()" does. So
-
- @fields = split(/\b(?:a|b|c)\b/)
-
- is like
-
- @fields = split(/\b(a|b|c)\b/)
-
- but doesn't spit out extra fields.
-
- The letters between ? and : act as flags
- modifiers, see the (?_i_m_s_x-_i_m_s_x) manpage. In
- particular,
-
- /(?s-i:more.*than).*million/i
-
- is equivalent to more verbose
-
-
-
- Page 6 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- /(?:(?s-i)more.*than).*million/i
-
-
- (?=pattern)
- A zero-width positive lookahead assertion. For
- example, /\w+(?=\t)/ matches a word followed by a
- tab, without including the tab in $&.
-
- (?!pattern)
- A zero-width negative lookahead assertion. For
- example /foo(?!bar)/ matches any occurrence of
- "foo" that isn't followed by "bar". Note however
- that lookahead and lookbehind are NOT the same
- thing. You cannot use this for lookbehind.
-
- If you are looking for a "bar" that isn't preceded
- by a "foo", /(?!foo)bar/ will not do what you
- want. That's because the (?!foo) is just saying
- that the next thing cannot be "foo"--and it's not,
- it's a "bar", so "foobar" will match. You would
- have to do something like /(?!foo)...bar/ for
- that. We say "like" because there's the case of
- your "bar" not having three characters before it.
- You could cover that this way:
- /(?:(?!foo)...|^.{0,2})bar/. Sometimes it's still
- easier just to say:
-
- if (/bar/ && $` !~ /foo$/)
-
- For lookbehind see below.
-
- (?<=pattern)
- A zero-width positive lookbehind assertion. For
- example, /(?<=\t)\w+/ matches a word following a
- tab, without including the tab in $&. Works only
- for fixed-width lookbehind.
-
- (?<!pattern)
- A zero-width negative lookbehind assertion. For
- example /(?<!bar)foo/ matches any occurrence of
- "foo" that isn't following "bar". Works only for
- fixed-width lookbehind.
-
- (?{ code })
- Experimental "evaluate any Perl code" zero-width
- assertion. Always succeeds. code is not
- interpolated. Currently the rules to determine
- where the code ends are somewhat convoluted.
-
- The code is properly scoped in the following
- sense: if the assertion is backtracked (compare
- the section on _B_a_c_k_t_r_a_c_k_i_n_g), all the changes
-
-
-
- Page 7 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- introduced after localisation are undone, so
-
- $_ = 'a' x 8;
- m<
- (?{ $cnt = 0 }) # Initialize $cnt.
- (
- a
- (?{
- local $cnt = $cnt + 1; # Update $cnt, backtracking-safe.
- })
- )*
- aaaa
- (?{ $res = $cnt }) # On success copy to non-localized
- # location.
- >x;
-
- will set $res = 4. Note that after the match $cnt
- returns to the globally introduced value 0, since
- the scopes which restrict local statements are
- unwound.
-
- This assertion may be used as (?(condition)yes-
- pattern switch. If _n_o_t used in this way, the
- result of evaluation of code is put into variable
- $^R. This happens immediately, so $^R can be used
- from other (?{ code }) assertions inside the same
- regular expression.
-
- The above assignment to $^R is properly localized,
- thus the old value of $^R is restored if the
- assertion is backtracked (compare the section on
- _B_a_c_k_t_r_a_c_k_i_n_g).
-
- Due to security concerns, this construction is not
- allowed if the regular expression involves run-
- time interpolation of variables, unless use re
- 'eval' pragma is used (see the _r_e manpage), or the
- variables contain results of _q_r() operator (see
- the section on _q_r/_S_T_R_I_N_G/_i_m_o_s_x in the _p_e_r_l_o_p
- manpage).
-
- This restriction is due to the wide-spread
- (questionable) practice of using the construct
-
- $re = <>;
- chomp $re;
- $string =~ /$re/;
-
- without tainting. While this code is frowned upon
- from security point of view, when (?{}) was
- introduced, it was considered bad to add _n_e_w
- security holes to existing scripts.
-
-
-
- Page 8 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- NNNNOOOOTTTTEEEE:::: Use of the above insecure snippet without
- also enabling taint mode is to be severely frowned
- upon. use re 'eval' does not disable tainting
- checks, thus to allow $re in the above snippet to
- contain (?{}) _w_i_t_h _t_a_i_n_t_i_n_g _e_n_a_b_l_e_d, one needs
- both use re 'eval' and untaint the $re.
-
- (?>pattern)
- An "independent" subexpression. Matches the
- substring that a _s_t_a_n_d_a_l_o_n_e pattern would match if
- anchored at the given position, aaaannnndddd oooonnnnllllyyyy tttthhhhiiiissss
- ssssuuuubbbbssssttttrrrriiiinnnngggg.
-
- Say, ^(?>a*)ab will never match, since (?>a*)
- (anchored at the beginning of string, as above)
- will match _a_l_l characters a at the beginning of
- string, leaving no a for ab to match. In
- contrast, a*ab will match the same as a+b, since
- the match of the subgroup a* is influenced by the
- following group ab (see the section on
- _B_a_c_k_t_r_a_c_k_i_n_g). In particular, a* inside a*ab will
- match fewer characters than a standalone a*, since
- this makes the tail match.
-
- An effect similar to (?>pattern) may be achieved
- by
-
- (?=(pattern))\1
-
- since the lookahead is in "_l_o_g_i_c_a_l" context, thus
- matches the same substring as a standalone a+.
- The following \1 eats the matched string, thus
- making a zero-length assertion into an analogue of
- (?>...). (The difference between these two
- constructs is that the second one uses a catching
- group, thus shifting ordinals of backreferences in
- the rest of a regular expression.)
-
- This construct is useful for optimizations of
- "eternal" matches, because it will not backtrack
- (see the section on _B_a_c_k_t_r_a_c_k_i_n_g).
-
- m{ \(
- (
- [^()]+
- |
- \( [^()]* \)
- )+
- \)
- }x
-
- That will efficiently match a nonempty group with
-
-
-
- Page 9 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- matching two-or-less-level-deep parentheses.
- However, if there is no such group, it will take
- virtually forever on a long string. That's
- because there are so many different ways to split
- a long string into several substrings. This is
- what (.+)+ is doing, and (.+)+ is similar to a
- subpattern of the above pattern. Consider that
- the above pattern detects no-match on
- ((()aaaaaaaaaaaaaaaaaa in several seconds, but
- that each extra letter doubles this time. This
- exponential performance will make it appear that
- your program has hung.
-
- However, a tiny modification of this pattern
-
- m{ \(
- (
- (?> [^()]+ )
- |
- \( [^()]* \)
- )+
- \)
- }x
-
- which uses (?>...) matches exactly when the one
- above does (verifying this yourself would be a
- productive exercise), but finishes in a fourth the
- time when used on a similar string with 1000000
- as. Be aware, however, that this pattern
- currently triggers a warning message under ----wwww
- saying it "matches the null string many times"):
-
- On simple groups, such as the pattern (? [^()]+
- )>, a comparable effect may be achieved by
- negative lookahead, as in [^()]+ (?! [^()] ).
- This was only 4 times slower on a string with
- 1000000 as.
-
- (?(condition)yes-pattern|no-pattern)
-
- (?(condition)yes-pattern)
- Conditional expression. (condition) should be
- either an integer in parentheses (which is valid
- if the corresponding pair of parentheses matched),
- or lookahead/lookbehind/evaluate zero-width
- assertion.
-
- Say,
-
-
-
-
-
-
-
- Page 10 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- m{ ( \( )?
- [^()]+
- (?(1) \) )
- }x
-
- matches a chunk of non-parentheses, possibly
- included in parentheses themselves.
-
- (?imsx-imsx)
- One or more embedded pattern-match modifiers.
- This is particularly useful for patterns that are
- specified in a table somewhere, some of which want
- to be case sensitive, and some of which don't.
- The case insensitive ones need to include merely
- (?i) at the front of the pattern. For example:
-
- $pattern = "foobar";
- if ( /$pattern/i ) { }
-
- # more flexible:
-
- $pattern = "(?i)foobar";
- if ( /$pattern/ ) { }
-
- Letters after - switch modifiers off.
-
- These modifiers are localized inside an enclosing
- group (if any). Say,
-
- ( (?i) blah ) \s+ \1
-
- (assuming x modifier, and no i modifier outside of
- this group) will match a repeated (_i_n_c_l_u_d_i_n_g _t_h_e
- _c_a_s_e!) word blah in any case.
-
- A question mark was chosen for this and for the new
- minimal-matching construct because 1) question mark is
- pretty rare in older regular expressions, and 2) whenever
- you see one, you should stop and "question" exactly what is
- going on. That's psychology...
-
- BBBBaaaacccckkkkttttrrrraaaacccckkkkiiiinnnngggg
-
- A fundamental feature of regular expression matching
- involves the notion called _b_a_c_k_t_r_a_c_k_i_n_g, which is currently
- used (when needed) by all regular expression quantifiers,
- namely *, *?, +, +?, {n,m}, and {n,m}?.
-
- For a regular expression to match, the _e_n_t_i_r_e regular
- expression must match, not just part of it. So if the
- beginning of a pattern containing a quantifier succeeds in a
- way that causes later parts in the pattern to fail, the
-
-
-
- Page 11 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- matching engine backs up and recalculates the beginning
- part--that's why it's called backtracking.
-
- Here is an example of backtracking: Let's say you want to
- find the word following "foo" in the string "Food is on the
- foo table.":
-
- $_ = "Food is on the foo table.";
- if ( /\b(foo)\s+(\w+)/i ) {
- print "$2 follows $1.\n";
- }
-
- When the match runs, the first part of the regular
- expression (\b(foo)) finds a possible match right at the
- beginning of the string, and loads up $1 with "Foo".
- However, as soon as the matching engine sees that there's no
- whitespace following the "Foo" that it had saved in $1, it
- realizes its mistake and starts over again one character
- after where it had the tentative match. This time it goes
- all the way until the next occurrence of "foo". The complete
- regular expression matches this time, and you get the
- expected output of "table follows foo."
-
- Sometimes minimal matching can help a lot. Imagine you'd
- like to match everything between "foo" and "bar".
- Initially, you write something like this:
-
- $_ = "The food is under the bar in the barn.";
- if ( /foo(.*)bar/ ) {
- print "got <$1>\n";
- }
-
- Which perhaps unexpectedly yields:
-
- got <d is under the bar in the >
-
- That's because .* was greedy, so you get everything between
- the _f_i_r_s_t "foo" and the _l_a_s_t "bar". In this case, it's more
- effective to use minimal matching to make sure you get the
- text between a "foo" and the first "bar" thereafter.
-
- if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
- got <d is under the >
-
- Here's another example: let's say you'd like to match a
- number at the end of a string, and you also want to keep the
- preceding part the match. So you write this:
-
- $_ = "I have 2 numbers: 53147";
- if ( /(.*)(\d*)/ ) { # Wrong!
- print "Beginning is <$1>, number is <$2>.\n";
- }
-
-
-
- Page 12 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- That won't work at all, because .* was greedy and gobbled up
- the whole string. As \d* can match on an empty string the
- complete regular expression matched successfully.
-
- Beginning is <I have 2 numbers: 53147>, number is <>.
-
- Here are some variants, most of which don't work:
-
- $_ = "I have 2 numbers: 53147";
- @pats = qw{
- (.*)(\d*)
- (.*)(\d+)
- (.*?)(\d*)
- (.*?)(\d+)
- (.*)(\d+)$
- (.*?)(\d+)$
- (.*)\b(\d+)$
- (.*\D)(\d+)$
- };
-
- for $pat (@pats) {
- printf "%-12s ", $pat;
- if ( /$pat/ ) {
- print "<$1> <$2>\n";
- } else {
- print "FAIL\n";
- }
- }
-
- That will print out:
-
- (.*)(\d*) <I have 2 numbers: 53147> <>
- (.*)(\d+) <I have 2 numbers: 5314> <7>
- (.*?)(\d*) <> <>
- (.*?)(\d+) <I have > <2>
- (.*)(\d+)$ <I have 2 numbers: 5314> <7>
- (.*?)(\d+)$ <I have 2 numbers: > <53147>
- (.*)\b(\d+)$ <I have 2 numbers: > <53147>
- (.*\D)(\d+)$ <I have 2 numbers: > <53147>
-
- As you see, this can be a bit tricky. It's important to
- realize that a regular expression is merely a set of
- assertions that gives a definition of success. There may be
- 0, 1, or several different ways that the definition might
- succeed against a particular string. And if there are
- multiple ways it might succeed, you need to understand
- backtracking to know which variety of success you will
- achieve.
-
- When using lookahead assertions and negations, this can all
- get even tricker. Imagine you'd like to find a sequence of
- non-digits not followed by "123". You might try to write
-
-
-
- Page 13 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- that as
-
- $_ = "ABC123";
- if ( /^\D*(?!123)/ ) { # Wrong!
- print "Yup, no 123 in $_\n";
- }
-
- But that isn't going to match; at least, not the way you're
- hoping. It claims that there is no 123 in the string.
- Here's a clearer picture of why it that pattern matches,
- contrary to popular expectations:
-
- $x = 'ABC123' ;
- $y = 'ABC445' ;
-
- print "1: got $1\n" if $x =~ /^(ABC)(?!123)/ ;
- print "2: got $1\n" if $y =~ /^(ABC)(?!123)/ ;
-
- print "3: got $1\n" if $x =~ /^(\D*)(?!123)/ ;
- print "4: got $1\n" if $y =~ /^(\D*)(?!123)/ ;
-
- This prints
-
- 2: got ABC
- 3: got AB
- 4: got ABC
-
- You might have expected test 3 to fail because it seems to a
- more general purpose version of test 1. The important
- difference between them is that test 3 contains a quantifier
- (\D*) and so can use backtracking, whereas test 1 will not.
- What's happening is that you've asked "Is it true that at
- the start of $x, following 0 or more non-digits, you have
- something that's not 123?" If the pattern matcher had let
- \D* expand to "ABC", this would have caused the whole
- pattern to fail. The search engine will initially match \D*
- with "ABC". Then it will try to match (?!123 with "123",
- which of course fails. But because a quantifier (\D*) has
- been used in the regular expression, the search engine can
- backtrack and retry the match differently in the hope of
- matching the complete regular expression.
-
- The pattern really, _r_e_a_l_l_y wants to succeed, so it uses the
- standard pattern back-off-and-retry and lets \D* expand to
- just "AB" this time. Now there's indeed something following
- "AB" that is not "123". It's in fact "C123", which
- suffices.
-
- We can deal with this by using both an assertion and a
- negation. We'll say that the first part in $1 must be
- followed by a digit, and in fact, it must also be followed
- by something that's not "123". Remember that the lookaheads
-
-
-
- Page 14 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- are zero-width expressions--they only look, but don't
- consume any of the string in their match. So rewriting this
- way produces what you'd expect; that is, case 5 will fail,
- but case 6 succeeds:
-
- print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/ ;
- print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/ ;
-
- 6: got ABC
-
- In other words, the two zero-width assertions next to each
- other work as though they're ANDed together, just as you'd
- use any builtin assertions: /^$/ matches only if you're at
- the beginning of the line AND the end of the line
- simultaneously. The deeper underlying truth is that
- juxtaposition in regular expressions always means AND,
- except when you write an explicit OR using the vertical bar.
- /ab/ means match "a" AND (then) match "b", although the
- attempted matches are made at different positions because
- "a" is not a zero-width assertion, but a one-width
- assertion.
-
- One warning: particularly complicated regular expressions
- can take exponential time to solve due to the immense number
- of possible ways they can use backtracking to try match.
- For example this will take a very long time to run
-
- /((a{0,5}){0,5}){0,5}/
-
- And if you used *'s instead of limiting it to 0 through 5
- matches, then it would take literally forever--or until you
- ran out of stack space.
-
- A powerful tool for optimizing such beasts is "independent"
- groups, which do not backtrace (see the (?>_p_a_t_t_e_r_n)
- manpage). Note also that zero-length lookahead/lookbehind
- assertions will not backtrace to make the tail match, since
- they are in "logical" context: only the fact whether they
- match or not is considered relevant. For an example where
- side-effects of a lookahead _m_i_g_h_t have influenced the
- following match, see the (?>_p_a_t_t_e_r_n) manpage.
-
- VVVVeeeerrrrssssiiiioooonnnn 8888 RRRReeeegggguuuullllaaaarrrr EEEExxxxpppprrrreeeessssssssiiiioooonnnnssss
-
- In case you're not familiar with the "regular" Version 8
- regex routines, here are the pattern-matching rules not
- described above.
-
- Any single character matches itself, unless it is a
- _m_e_t_a_c_h_a_r_a_c_t_e_r with a special meaning described here or
- above. You can cause characters that normally function as
- metacharacters to be interpreted literally by prefixing them
-
-
-
- Page 15 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- with a "\" (e.g., "\." matches a ".", not any character;
- "\\" matches a "\"). A series of characters matches that
- series of characters in the target string, so the pattern
- blurfl would match "blurfl" in the target string.
-
- You can specify a character class, by enclosing a list of
- characters in [], which will match any one character from
- the list. If the first character after the "[" is "^", the
- class matches any character not in the list. Within a list,
- the "-" character is used to specify a range, so that a-z
- represents all characters between "a" and "z", inclusive.
- If you want "-" itself to be a member of a class, put it at
- the start or end of the list, or escape it with a backslash.
- (The following all specify the same class of three
- characters: [-az], [az-], and [a\-z]. All are different
- from [a-z], which specifies a class containing twenty-six
- characters.)
-
- Characters may be specified using a metacharacter syntax
- much like that used in C: "\n" matches a newline, "\t" a
- tab, "\r" a carriage return, "\f" a form feed, etc. More
- generally, \_n_n_n, where _n_n_n is a string of octal digits,
- matches the character whose ASCII value is _n_n_n. Similarly,
- \x_n_n, where _n_n are hexadecimal digits, matches the character
- whose ASCII value is _n_n. The expression \c_x matches the
- ASCII character control-_x. Finally, the "." metacharacter
- matches any character except "\n" (unless you use /s).
-
- You can specify a series of alternatives for a pattern using
- "|" to separate them, so that fee|fie|foe will match any of
- "fee", "fie", or "foe" in the target string (as would
- f(e|i|o)e). The first alternative includes everything from
- the last pattern delimiter ("(", "[", or the beginning of
- the pattern) up to the first "|", and the last alternative
- contains everything from the last "|" to the next pattern
- delimiter. For this reason, it's common practice to include
- alternatives in parentheses, to minimize confusion about
- where they start and end.
-
- Alternatives are tried from left to right, so the first
- alternative found for which the entire expression matches,
- is the one that is chosen. This means that alternatives are
- not necessarily greedy. For example: when mathing foo|foot
- against "barefoot", only the "foo" part will match, as that
- is the first alternative tried, and it successfully matches
- the target string. (This might not seem important, but it is
- important when you are capturing matched text using
- parentheses.)
-
- Also remember that "|" is interpreted as a literal within
- square brackets, so if you write [fee|fie|foe] you're really
- only matching [feio|].
-
-
-
- Page 16 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- Within a pattern, you may designate subpatterns for later
- reference by enclosing them in parentheses, and you may
- refer back to the _nth subpattern later in the pattern using
- the metacharacter \_n. Subpatterns are numbered based on the
- left to right order of their opening parenthesis. A
- backreference matches whatever actually matched the
- subpattern in the string being examined, not the rules for
- that subpattern. Therefore, (0|0x)\d*\s\1\d* will match
- "0x1234 0x4321", but not "0x1234 01234", because subpattern
- 1 actually matched "0x", even though the rule 0|0x could
- potentially match the leading 0 in the second number.
-
- WWWWAAAARRRRNNNNIIIINNNNGGGG oooonnnn \\\\1111 vvvvssss $$$$1111
-
- Some people get too used to writing things like:
-
- $pattern =~ s/(\W)/\\\1/g;
-
- This is grandfathered for the RHS of a substitute to avoid
- shocking the sssseeeedddd addicts, but it's a dirty habit to get
- into. That's because in PerlThink, the righthand side of a
- s/// is a double-quoted string. \1 in the usual double-
- quoted string means a control-A. The customary Unix meaning
- of \1 is kludged in for s///. However, if you get into the
- habit of doing that, you get yourself into trouble if you
- then add an /e modifier.
-
- s/(\d+)/ \1 + 1 /eg; # causes warning under -w
-
- Or if you try to do
-
- s/(\d+)/\1000/;
-
- You can't disambiguate that by saying \{1}000, whereas you
- can fix it with ${1}000. Basically, the operation of
- interpolation should not be confused with the operation of
- matching a backreference. Certainly they mean two different
- things on the _l_e_f_t side of the s///.
-
- RRRReeeeppppeeeeaaaatttteeeedddd ppppaaaatttttttteeeerrrrnnnnssss mmmmaaaattttcccchhhhiiiinnnngggg zzzzeeeerrrroooo----lllleeeennnnggggtttthhhh ssssuuuubbbbssssttttrrrriiiinnnngggg
-
- WARNING: Difficult material (and prose) ahead. This section
- needs a rewrite.
-
- Regular expressions provide a terse and powerful programming
- language. As with most other power tools, power comes
- together with the ability to wreak havoc.
-
- A common abuse of this power stems from the ability to make
- infinite loops using regular expressions, with something as
- innocous as:
-
-
-
-
- Page 17 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- 'foo' =~ m{ ( o? )* }x;
-
- The o? can match at the beginning of 'foo', and since the
- position in the string is not moved by the match, o? would
- match again and again due to the * modifier. Another common
- way to create a similar cycle is with the looping modifier
- //g:
-
- @matches = ( 'foo' =~ m{ o? }xg );
-
- or
-
- print "match: <$&>\n" while 'foo' =~ m{ o? }xg;
-
- or the loop implied by _s_p_l_i_t().
-
- However, long experience has shown that many programming
- tasks may be significantly simplified by using repeated
- subexpressions which may match zero-length substrings, with
- a simple example being:
-
- @chars = split //, $string; # // is not magic in split
- ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
-
- Thus Perl allows the /()/ construct, which _f_o_r_c_e_f_u_l_l_y _b_r_e_a_k_s
- _t_h_e _i_n_f_i_n_i_t_e _l_o_o_p. The rules for this are different for
- lower-level loops given by the greedy modifiers *+{}, and
- for higher-level ones like the /g modifier or _s_p_l_i_t()
- operator.
-
- The lower-level loops are _i_n_t_e_r_r_u_p_t_e_d when it is detected
- that a repeated expression did match a zero-length
- substring, thus
-
- m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;
-
- is made equivalent to
-
- m{ (?: NON_ZERO_LENGTH )*
- |
- (?: ZERO_LENGTH )?
- }x;
-
- The higher level-loops preserve an additional state between
- iterations: whether the last match was zero-length. To
- break the loop, the following match after a zero-length
- match is prohibited to have a length of zero. This
- prohibition interacts with backtracking (see the section on
- _B_a_c_k_t_r_a_c_k_i_n_g), and so the _s_e_c_o_n_d _b_e_s_t match is chosen if the
- _b_e_s_t match is of zero length.
-
- Say,
-
-
-
- Page 18 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- $_ = 'bar';
- s/\w??/<$&>/g;
-
- results in "<<b><><a><><r><>">. At each position of the
- string the best match given by non-greedy ?? is the zero-
- length match, and the _s_e_c_o_n_d _b_e_s_t match is what is matched
- by \w. Thus zero-length matches alternate with one-
- character-long matches.
-
- Similarly, for repeated m/()/g the second-best match is the
- match at the position one notch further in the string.
-
- The additional state of being _m_a_t_c_h_e_d _w_i_t_h _z_e_r_o-_l_e_n_g_t_h is
- associated to the matched string, and is reset by each
- assignment to _p_o_s().
-
- CCCCrrrreeeeaaaattttiiiinnnngggg ccccuuuussssttttoooommmm RRRREEEE eeeennnnggggiiiinnnneeeessss
-
- Overloaded constants (see the _o_v_e_r_l_o_a_d manpage) provide a
- simple way to extend the functionality of the RE engine.
-
- Suppose that we want to enable a new RE escape-sequence \Y|
- which matches at boundary between white-space characters and
- non-whitespace characters. Note that
- (?=\S)(?<!\S)|(?!\S)(?<=\S) matches exactly at these
- positions, so we want to have each \Y| in the place of the
- more complicated version. We can create a module customre
- to do this:
-
- package customre;
- use overload;
-
- sub import {
- shift;
- die "No argument to customre::import allowed" if @_;
- overload::constant 'qr' => \&convert;
- }
-
- sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}
-
- my %rules = ( '\\' => '\\',
- 'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
- sub convert {
- my $re = shift;
- $re =~ s{
- \\ ( \\ | Y . )
- }
- { $rules{$1} or invalid($re,$1) }sgex;
- return $re;
- }
-
- Now use customre enables the new escape in constant regular
-
-
-
- Page 19 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
- expressions, i.e., those without any runtime variable
- interpolations. As documented in the _o_v_e_r_l_o_a_d manpage, this
- conversion will work only over literal parts of regular
- expressions. For \Y|$re\Y| the variable part of this
- regular expression needs to be converted explicitly (but
- only if the special meaning of \Y| should be enabled inside
- $re):
-
- use customre;
- $re = <>;
- chomp $re;
- $re = customre::convert $re;
- /\Y|$re\Y|/;
-
-
- SSSSEEEEEEEE AAAALLLLSSSSOOOO
-
- the section on _R_e_g_e_x_p _Q_u_o_t_e-_L_i_k_e _O_p_e_r_a_t_o_r_s in the _p_e_r_l_o_p
- manpage.
-
- the section on _G_o_r_y _d_e_t_a_i_l_s _o_f _p_a_r_s_i_n_g _q_u_o_t_e_d _c_o_n_s_t_r_u_c_t_s in
- the _p_e_r_l_o_p manpage.
-
- the pos entry in the _p_e_r_l_f_u_n_c manpage.
-
- the _p_e_r_l_l_o_c_a_l_e manpage.
-
- _M_a_s_t_e_r_i_n_g _R_e_g_u_l_a_r _E_x_p_r_e_s_s_i_o_n_s (see the _p_e_r_l_b_o_o_k manpage) by
- Jeffrey Friedl.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Page 20 (printed 10/23/98)
-
-
-
-
-
-
- PPPPEEEERRRRLLLLRRRREEEE((((1111)))) 7777////AAAAuuuugggg////99998888 ((((ppppeeeerrrrllll 5555....000000005555,,,, ppppaaaattttcccchhhh 00002222)))) PPPPEEEERRRRLLLLRRRREEEE((((1111))))
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Page 21 (printed 10/23/98)
-
-
-
-
-
-
-